Text Mining with n-gram Variables

نویسندگان
چکیده

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

n-Gram-Based Text Compression

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bi...

متن کامل

N-gram-based Text Attribution

Quantitative authorship attribution refers to the task of identifying the author of a text based on measurable features of the author’s style—a problem that has practical application in areas as diverse as literary scholarship, plagiarism detection, and criminal forensics. Attribution methods generally follow a generative approach, wherein a statistical “profile” is created for a set of candida...

متن کامل

N-Gram-Based Text Categorization

Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization ...

متن کامل

Improved Text Generation Using N-gram Statistics

In Natural Language Generation (NLG) systems, a generalpurpose surface realisation module will usually require the underlying application to provide highly detailed input knowledge about the target sentence. As an attempt to reduce some of this complexity, in this paper we follow a traditional approach to NLG and present a number of experiments involving the use of n-gram language models as an ...

متن کامل

Language Identification of Short Text Segments with N-gram Models

There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5–21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: The Stata Journal: Promoting communications on statistics and Stata

سال: 2017

ISSN: 1536-867X,1536-8734

DOI: 10.1177/1536867x1801700406